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PATTERN ASSOGIATOR (PA) 
Trained with the Hebb Rule 


Let us train a network to learn an association between 
an input pattern i, and an output pattern 0»; 
According to the Hebb Rule, the weights are: 


Wii =€ Oin ln _ Or, in vectorial form W = € 0, i! 


Thus, the PA learns the weights of an association with the 
Outer product of the vectors to be associated 


now, let us present a test input pattern i, and examine 
the resulting output pattern it produces 


since the units are linear, the activation of output unit 
i when tested with Input pattern i, is: 


On = Dy Wik lit or, in vectorial form O, = W L, 


and, substituting W with its definition : 


0, = £0, 1,7; 
and, for the associative property of vector product 


0; = € 0, (i,' iy) 


so, the test output is proportional to the product of 
the output of the learned association times the inner 
product of the input of the learned association times the 
test input vector. 
ee 


LEARNING SEVERAL ASSOCIATIONS: 


After learning trials on each of a set of input patterns 
(Ip Op) the weights are: 


7 


Wii =z 2X Oi lin or, in vectorial form, W= € 2.0, ly 


Thus, the output produced by a test pattern is: 


= ae 
| p 


In words, the output of the network in response to 
input pattern i,is the sum of the output patterns that 


occurred during learning (0,), with each pattern 
contribution weighted by the similarity (i, T+ i) of the 


corresponding input pattern (i,) to the test pattern (L.). 
. AG 


Thus: 


in general, the output of a test will always a blend of 
the training outputs, with the contribution of each output 
pattern weighted by the similarity of the corresponding . 
input pattern to the test input pattern. 


Comments aloout PA: 


learns with outer product; 
recall with inner product; 


it can store several associations into the same net 
(weight matrix) if the input vectors are 
orthogonal; otherwise some hybrid patterns are 
created and some of the stored patterns can no 
more be recalled; 


so, in real cases the number of associations which 
can be stored is strongly limited; 


several variants and improvements have been 
proposed (e.g. Brain-State-in-a-Box, Hopfield 
Nets, etc.). 


BSB - Recall/Learning 


Recall: 
aj(t+1) = Limit( a ~ Wij aj + B aj(t) + y a;(0) ) 


where Limit(x) = 
“1 lfx<-1 
+1 ifx>+T 
x if-1<x<41 


Learning: 


Hebb rule: wW::=€ 2, iy 


where, for auto-associators, i=o 


The vectors to be stored must be orthogonal. If they are not 
orthogonal some hybrid patterns may be created and some of the 
original associations are lost. 


Delta rule (Widrow-Hoff): 
Wig = EL, (hp - Op) tip 


The vectors to be stored must be linearly independent; if the 
input is noisy, the BSB tends to store an average of the 
variations so building a “prototype” from examples. 
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ADALINGE 
Learning Algorithm 


lt adapts the weights on the bases of the difference 
between the weighted sum (before the _ threshold 
nonlinearity) and the target output. Each weight is 
modified in such a way to reduce the error of C/n (n = 
number of inputs, C = learning rate). The algorithm 
converges to a set of weights which minimizes the mean 
Square error. The’ stability of the solution depends on C. 


1. Initialize Weights and Threshold 
Set wj(0) (0 < is N-1) and @ to small random values. Here 
wj(t) is the weight from input | at time t and @ is the 
threshold in the output node 

2. Present New Input and Desired Output 


Present new binary input xX=Xo,X;,...Xn., together with the 
desired output d(t) 


3. Calculate Actual Output 

net(t) = 24 wi(t)xj(t) - 8 O(t) = f,(net(t)) 
4. Adapt Weights (Widrow rule) 

the weights are adapted with the formula: 

wi(t+1) = wy(t) + (C/n)(d(t) - net(t))xj(t) 

where Cis a positive coefficient (< 1) and d(t) is the correct 

output (desired) for the current input. Notice that the weights 
are modified (unlike the delta rule) also if the net takes the 


right decision. 


5. Repeat by going to Step 2 
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Adaline and Mada. -ne 


Coding of Adaline Inputs 


As a highly applications oriented engineering 

group, the Stanford Electronics Laboratories under 
the influence and later direction of Bernard 

Widrow were prolific in their application of this 
technology to real problems. Among the several 
application areas are weather forecasting, speech 
recognition, vectorcardiogram diagnosis, and control 
systems as illustrated by broom balancing. 


One of the techniques used in several of the 
adaline and madaline experiments is the use of 
"linearly independent codes". This is basically a 
method of decomposing a single input value into a 
series of “codes” which form an invertible matrix 
consisting of only 0 and 1 terms. Examples of a 
linearly independent code is shown below: 


Value of X Code 1 Code 2 

X < 29 1000900 L 4.22 
99 <e X < 30 01000 11110 
30 <e X < 31 00100 1i1100 
31 <e X¥< 32 00010 11000 
32 <= X o0001- 10000 
Input Number 12345 12345 


Examole: Suppose X = 29.5. Code 1 for this value 
correspénds to {0,1,0,0,0}. This maps to an input 
-to the adaline of fel,i,-1,71,-1}. This is an 
essential technique for encoding information for 
use of an adaline, and other paradigms. 


For example, in the weather forecasting example, 
which is described in more detail in a later 
section, barometric pressure and difference in 
barometric with the previous day are used as a basis 
for doing the prediction. These are analog values 
which must each be encoded into binary form using 
linearly independent codes. These codes are then 


concatenated to provide a binary input vector to the 
adaline or madaline (with -1's replacing 0's)- 


HOPFIELD Net Algorithm 


1. Learning (fixed point) 


assign the connection weights according to the formula: 


Wij = Sy x? x8, iz) ,Osi,j <N-1 
s =0 | 


where Wjj is the weight from node i to node j,M is the 


number of patterns to be learned, and a>, which may be +1 
or -1, is the element | of the pattern S: 


or, as vectorial equation 
M—1 
T | 
W = Y Xe Xe ~~ T 
s=0 


example: 


to store the vector [+1 -1 +1 +1] will be used a network with 
four units, with the weight matrix computed as follow: 


+1 | [+1 -1 +1 41] 


+1 ) O -1 +1 +1 
‘ — fo +t of — J-10 1 -1 
+1 0 4, Of ~ = J41-1 0 41 
+1 +1 +1 -1 41 0 


ea 


PERCEPTRON 






OUTPUT 


7 +1 = CLASS A 
. { “1 sCLASSB 


Xo 


- it is a simple net used for CLASSIFICATION 

* proposed by Rosenblatt (1959) as a model which 
learns to classify visual patterns 

¢ binary or continuous input; binary output 

- the single node computes a weighted sum of the Input 
elements, subtracts a threshold and passes the result 
through a hard limiting nonlinearity such that the output 
is either +1 (for class A) or -1 (for class (B) 


Id 


PERCEPTRON - observations 


Rosenblatt has proved that, If the input samples 
come from linearly separable classes, the Perceptron 
learning algorithm converges and places a 
discriminant hyperplane between the two classes. 


but often two classes are not separable with an 
hyperplane, as the classical case of Exclusive OR 
proposed by Minsky as an example of the limitations 
of Perceptron. 


in that case the learning algorithm leads to 
oscillations, or, if modified, finds a solution which 
minimizes the mean square error between the actual 
output and the target output. 


the Perceptron limitations are due to the simplicity 
of the model. In fact was soon understood that more 
levels would have provided the model with an 
increased discrimination power. But there was no 
training algorithm for multi-layer nets. 


recent studies has lead to the demonstration of the 
convergence of a training algorithm for multi-layer 
perceptrons (Rumelhart et alli, 1986) 


MULTILAYER PERCEPTRON 


OUTPUT 


HIDDEN 
UNITS 





Multi-Layer Perceptron (MLP) is a feed-forward 
network with one or more layers of hidden units 
between the Input and output layers; usually, the units 
at each level are fully connected with the following 
level, without feedback; 


continuous input and output ¢(0,1), Sigmoid transfer 
function 


recently (1986) a learning algorithm has been proposed 
(Error Back Propagation) 


MLP overcomes the limitations of Perceptron, and it Is 
promising for several applications (pattern 
recognition, control, non rule-based expert systems) 


23 


MLP Learning Algorithm 
(Back-Propagation) 


It is a gradient algorithm designed to minimize the mean square 
error between the actual output and the desired output of a MLP. 
It goes on modifying the weights in such a way to move in the 
opposite direction of the gradient of the error function 
(computed with respect to weights) performing a Steepest 
Descent Optimization. 


aie 


Initialize Weights and Biases 
set all weigths and node biases to small random numbers 


Present an example (input vector and desired output 
vector) | 


The input is a vector of real numbers x,,X,,..-.Xy., to which 
correspond a vector of desired outputs d,, ..., dy.,- 


lf the net is used as a classifier then all desired outputs are 
typically set to 0 except for that corresponding to the class ne 
input is from. That desired output is 1. 


. Calculate Actual Outputs 


given a pattern p, the input to each unit is an with 
the formula: 


net o => w) 07 +6 
then the output o,j is computed 
Op) = Ay = f(net,j) 
where 
{(Z)= 


1+e” 
295 









(Sopon 
jo JaqUINN 
Aq pau 
Ayxajdwog) 
AUVULIGHY 









SHAAVI © 
SHAAVI G 






SNOISSY 
asasold 





yO 
NidO 
XS3ANO9 





ANV Idd3dAH 
AG 
gqaqNnnog 
ANV 1d 41VH 


WO o£ 
\ C 


/ 
RW ORE 
saawns [ suomsy | | snows 
NOISIOSG | 
bcos Pees ‘. siemens 40 S3dAL HNLINYLS 


SNOULGEOUAd GIVAAVT AG GAWUOS SNOMAY NOMSIOSG 


LL 





PREDICTION 


STOCK MARKET 





29 


Prediction 


values 


Previous 


é (yoedw 0} soue}siq) 7 





JUIOg JOedWy 


BYO|V 


(AWOOJOA Jeu) A A 


, 
= 


[482 bmwis sf + OnIs.) 7897 LY 








H3IIWId srvILsINVY 


Self-Organizing Feature Maps 
Training Algorithm — 


Starting from random values, the weights are iteratively 
updated in such a way that nodes topologically near are 
responsive (high output) to similar inputs. Given an input vector, 
the nodes compete to learn: the weights of the winner (minimum 
distance between its weights and the input pattern) and of its 
neighbors are adapted to reinforce their output. 


1. Initialize Weights and Neighborhood radius 


Initialize weights from N inputs to the M outputs nodes to 
small random values. Set the initial radius of 
neighborhood (the distance within which the nodes 
topologically adjacent to the winner are updated). 


2. Present new input 


Input a new vector of real numbers X,,...,.X\y-4 


3. Compute Distance to al! Nodes 


Compute distance dj between the input vector and each 


output node j using 


N-1 
= 2 
d; = * (xXj(t) — Wit) ) 
i =0 
where xXj(t) is the value of input node i and w jj (t) is 
the weight from input node i to output node j at time t. 
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Translation of a Constraint 
Satisfaction problem 
to Neural Nets formalism 


Unit <-> Hypothesis 

Connection <-— Constraint 

Weight <-— Constraint relevance 

External Input «<— Direct Evidence of a Hypothesis 


Units Bilas <— Ai priori Probability of a Hypothesis 


For example, if whenever hypothesis A is true, 
hypothesis B Is usually true, we would have a positive 
connection from unit A to unit B. If, on the other hand, — 
hypothesis A provides evidence against hypothesis B, 
we would have a negative connection from unit A to unit 


The Importance of the constraint is reflected by the 
importance of the strength of the connection 
representing the constraint. If the constraints Is very 
important the weights are large. Less important 


contraints involve smaller weights. 
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3. the goodness of fit for a hypothesis when direct 
evidence Is available; 


the goodness of fit for a hypothesis when direct 
evidence Is available is given by the product of 


the input.value times the activation value of the 
unit 


The overall degree to which a unit I contributes to the 
overall goodness is: 


goodness; = y Wii aj aj + input, a; + bias; =F 


and, summed over all the units, the overall pee: of 
fit of the net Is: 


goodness = yy Wj a) aj + ) input; a; + y bias; a, 
i 


i | | 


we have solved the problem when we have found a set of 
activation values that maximizes this function 
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OBSERVATIONS: 


Goodness value does not Increase indefinitely, but only 
until the units reach thelr own maximal or minimal 
activation values. 


When It reaches this peak in the goodness function, the 
goodness can no longer change and the network Is said 
to have reached a stable state. The net has settled or 
relaxed to a solution. 


This way of updating the activation values Is a Hill 
Climbing method, and can be guaranteed only to find a 
local maximum. rather than a global maximum. 
Updating can be: 

synchronous / asynchronous 

Slow / fast 

synchronous updating may lead to oscillating states: 


4g 


SIMULATED ANNEALING 
(Kirkpatrick et al., 1983) 


Simulated Annealing Is an optimization technique to 
find the global minimum of a function: 


¢ a standard technique is the steepest gradient 
descent : but one can get stuck Into local minima; 


« simulated annealing makes use of a thermic 
agitation to escape from local minima . 


SIMULATED ANNEALING: 


there are three elements: 
1. a probability density, to generate the state 
transitions; 
2. a Temperature and a cooling policy; 
3. an acceptance probability; 


Starting from a random state, at each step a new state 
is generated, according with the transition probability. 
If the new state has a lower cost, it is accepted; if it - 
has a higher cost it is accepted with a_ certain 
probability. The transition and acceptance probabilities 
depends on temperature. 

The procedure is’ iterated while lowering the 
temperature. 7 

When the system settles, it is in a global minimum of 
the cost function to be optimized. 


COST 7 LN 


SIMULATED ANNEALING 
NEURAL NETWORKS (2) 


It is a procedure to make the net (BM) to 
settle to a global minimum of its energy 
(maximum of its goodness of fit with the 
constrains = weights+inputs): 


1. The input vector is clamped on the net 


2. The following procedure is iterated, starting at a 
high temperature 


9.1. the units activation levels are updated for some 
time; 


2.2. the temperature Is decreased, according to a 


cooling policy; 
es.: T(t) = T(0) / log(1+t) 


3. when a low temperature is. reached, the outputs 
are read on the units. 


As BM is a stocastic model the output Is a 
probability distribution on the possible output vectors, 
which may be estimated collecting the outputs for a 
certain interval of time. : 


There are mathematical results that guarantee that 
the system will end up in a global minimum if the 
system is annealed slow enough. 
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The Travelling Salesman 
Problem (1) 


A salesman must visit a series of cities. Each city 
should be visited only once. After the final city ts 
visited, the salesman returns to the starting city. The 
distance between each city is known. What is the 


shortest possible tour the salesman can make? 


This problem is NP-complete, that is the number of 
steps for all the optimal algorithms’ increases 
exponentially with the number of cities. (# of paths = 
n!/2n) 


Hopfield and Tank have proposed a _ neural 
implementation to find sub-optimal solutions in real 
time: 


Problem description: 


¢ forn cities, n independent sets of n neurons are needed to 
represent a complete tour, one for each position in the tour. 
a matrix is used where the unit ux ; is active (= 1) if the 
city x is visited as i-th city of the tour, and it is disactive (= 


QO) otherwise. 
|123 45 
Asl01000 
B10001 0 
C 11000 0 
D!I00001 
E!l100100 


¢ Then, we derive the weights wx; for each connection. 
This is done by comparing the appropriate terms in the 
energy function for the Traveling Salesman problem and 
the general form of the energy function. 


icc - general energy function 
E= 722 Wij aj ay — 2 aj 9; for Hopfield nets 
1=1 j= 1= 


WwW --—-A 1—5:: inhibitory connection 
Xi, Yj Oxy ( ij) sap eces daa 
-Bdé,;(1- Sxy) inhibitory connections within 
each column | 
a © global inhibition 


-—D dyy (S541 Bo 8; i-1) data term 


[5; = 1 if i= andis 0 other wise] 


Oy; =- +CN excitation bias 


- Finally, we make the net to settle to a local minimum of 
the energy, and read on the units the solution of the 
problem. : 
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VECTORIAL Inner Product 


The INNER PRODUCT of two vectors is the sum of the 
products of the correspondent elements of the vectors 


3 


va |-T ws 2 
2 
+ | 
v w=[3 -1 2] 1 


= (3°1)+(-1*2)+(2"1) = 3 


GEOMETRIC INTERPRETATION in 2D: 


the inner product of two vectors is connected with thelr 
projection, and gives a measure of the similarity of the 
vectors | 





v! we |Ivi} [Iwi|cos @ 


x = |jvijcos 8 = 


{wil 
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MATRIX 
INNER PRODUCT 
ah @ 

OUTER PRODUCT 


0 
INNER 
viu=[3 1 24/4/= [6] = 6 ppropuct 
1 
: 0 0 0 0 OUTER 
uv=/4/[3 1 2] 312 4 8 PRODUCT 
1 312 


as 


